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ABSTRACT 



The present invention is embodied in an apparatus, and 
related method, for detecting and recognizing an object in an 
image frame. The object may be, for example, a head having 
particular facial characteristics. The object detection process 
uses robust and computationally efficient techniques. The 
object identification and recognition process uses an image 
processing technique based on model graphs and bunch 
graphs that efficiently represent image features as jets. The 
jets are composed of wavelet transforms and are processed 
at nodes or landmark Locations on an image corresponding 
to readily identifiable features. The system of the invention 
is particularly advantageous for recognizing a person over a 
wide variety of pose angles. 

42 Claims, 21 Drawing Sheets 
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FACE RECOGNITION FROM VIDEO object is a head of a person exhibiting a facial region. The 

IMAGES bunch graph may be based on a three- dimensional repre- 
sentation of the object. Further, the wavelet transformation 

CROSS-REFERENCES TO RELAXED may be performed using phase calculations that are per- 

APPLI CATIONS 5 formed using a hardware adapted phase representation. 

This application claims priority under 35 U.S.C. §119(e) Io m amative embodiment of the invention, the object 

(1) and 37 C.ER. §1 .78(a)(4) to U.S. provisional application * m a *W™x of images and the step of detecting an object 

serial number 60/081,615 filed Apr. 13, 1998 and tided ***** mcludes Uackin S the between image frames 

VISION ARCHITECTURE TO DESCRIBE FEATURES based on a tra J ector > r associated with the object. Also, the 

OF PERSONS 10 stc P °^ ^ oca ^ n § mc nodes includes tracking the nodes 

between image frames and reinitializing a tracked node if 

FIELD OF THE INVENTION me node's position deviates beyond a predetermined posi- 
tion constraint between image frames. Additionally, the 

The present invention relates to vision-based object detec- frames may be ^ CTeo images ^ the step of detecting 

uon and tracking, and more particularly, to systems for 15 may include detecting convex regions which are associated 

detecting objects in video images, such as human faces, and Qe ad movement 

.racking and identifying the objects in real time. 0thM featUKS ^ advanlages of ^ present inventioiI 

BACKGROUND OF THE INVENTION should be apparent from the following description of the 

preferred embodiments, taken in conjunction with the 

Recently developed object and face recognition tech- 20 accompanying drawings, which illustrate, by way of 

niques include the use of elastic bunch graph matching. The example, the principles of the invention. 

bunch graph recognition technique is highly effective for ™»™ _ rt „ _ „ 

recognking faces when the image being analyzed is seg- BR1EF DESCRIPTION OF THE DRAWINGS 

mented such that the face portion of the image occupies a FIG. I is a block diagram of a face recognition process, 

substantial portion of the image. However, the elastic bunch 25 according to the invention. 

graph technique may not reliably detect objects in a large FIG. 2 is a block diagram of a face recognition system, 

scene where the object of interest occupies only a small according to the invention. 

fraction of the scene. Moreover, for real-time use of the FIG. 3 is a series of images for showing detection, finding 

elastic bunch graph recognition technique, the process of and identification processes of the recognition process of 

segmenting the image must be computationally efficient or 30 pjQ j 

many of the performance advantages of the recognition mQ 4 u a block & of me head detection and 

technique are not obtained. procesS; according to me invention. 

Accordingly, there exists a significant need for an image FIG. 5 is a flow chart, with accompanying images, for 

processing techmque for detecting an object in video images fllustratin ^ a disparitv detection process according to the 

and preparing the video image for further processing by an invention 

bunch graph matching process in a computationally efficient nG . 6 " fa . xbemitic of a detector> 

manner. The present invention satisfies these needs. . tl _ . ^ ° 

r according to the invention. 

SUMMARY OF THE INVENTION FIG. 7 is a flow chart of a head tracking process, accord- 

40 ing to the invention. 

The present invention is embodied in an apparatus, and nG g ig a flow chart of a presclcctor , accord ing to the 

related method, for detecting and recognizing an object in an invention 

image frame. The object detection process uses robust and ™„ a \ a . . . . , , . 

. ti «= * . . u * ■ . •« t'ti FIG. 9 is a flow chart, with accompanying photographs, 

computationally efficient techniques. The object identifica- - •„ * i j i * j_- , i_ • r *u r • I 

r • % . • for illustrating a landmark finding technique of the facial 

tion and recogmuon process uses an image processing A <r , r i 
. u • uj i_ juu u*u* 45 recognition apparatus and system of FIG. 1. 

technique based on model graphs and bunch graphs that ^ . <. . ,. 

efficiendyrepresentimagefeaturesasjets.Thesystemofthe f ™. 10 is a series of images showing processing of a 

invention is particularly advantageous for recognizing a facial ^ ""W Gabor wavelets - ^ccorduig to the tnven- 

person over a wide variety of pose angles. on ' . 

In an embodiment of the invention, the object is detected 50 ™. 11 15 a ""f 9 of graphs showing the construction of 

and a portion of the image frame associated with the object a J ct > ima S e f a P h > ™t*™ C * ***** ^ 

is bounded by a bounding box. The bound portion of the P 1 ™? ° f 10 ' ^ t0 te * venb ° n - 

image frame is transformed using a wavelet transformation . ^ 12 f a dia S ran ? of an model graph, accordmg to the 

to generate a transformed image. Nodes associated with invention, for processing facial images, 

distinguishing features of the object defined by wavelet jets 55 ™G. 13 mcludes two diagrams showmg the use of wave- 

of a bunch graph generated from a plurality of representative lct processing to locate facial features, 

object images are located on the transformed image. The FIG. 14 is a diagram of a face with extracted eye and 

object is identified based on a similarity between wavelet mouth regions, for illustrating a course-to-fine landmark 

jets associated with an object image in a gallery of object finding technique. 

images and wavelet jets at the nodes on the transformed 50 FIG. IS is a schematic diagram illustrating a circular 

image. behavior of phase. 

Additionally, the detected object may be sized and cen- FIG. 16 are schematic diagrams illustrating a two's 
tered within the bound portion of the image such that the complement representation of phase having a circular 
detected object has a predetermined size and location within behavior, according to the invention, 
the bound portion and background portions of the bound 65 FIG. 17 is a flow diagram showing a tracking technique 
portion of the image frame not associated with the object for tracking landmarks found by the landmark finding tech- 
prior to identifying the object may be suppressed. Often, the nique of tbe invention. 
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FIG. 18 is a series of facial images showing tracking of provided to a landmark finding process 18 for detecting the 

facial features, according to the invention. individual facial features using the elastic bunch graph 

FIG. 19 is a diagram of a gaussian image pyramid technique. Once facial landmarks have been found on the 

technique for illustrating landmark tracking in one dimen- fac \ al ™» • landmark tracing process 20 may be used to 

S - Qn 5 track of the landmarks. The features extracted at the land- 

marks are then compared against corresponding features 

FIG. 20 is a series of two facial images, with accompa- cxtractcd from gallery images by an identifier process 22. 

nying graphs of pose angle versus frame number, showing This division of the image recognition process is advanta- 

tracking of facial features over a sequence of 50 image geous because the landmark finding process is relatively 

frames. time-consuming and often may not be performed in real time 

FIG. 21 is a flow diagram, with accompanying on a series of image frames having a relatively high frame 

photographs, for illustrating a pose estimation technique of rate. Landmark tracking, however, on the other hand, may be 

the recognition apparatus and system of FIG. 1. performed faster than frame rate. Thus, while the initial 

FIG. 22 is a graph of a pinhole camera model showing the . landmark finding process is occurring, a buffer may be filled 

orientation of three-dimensional (3-D) view access. 15 with new incoming image frames. Once the landmarks are 

FIG. 23 is a perspective view of a 3-D camera calibration located > landmark tracking is started and the. processing 

configuration. system may catch up by processing the buffered images is 

„5 ,i . ' L . « . f 4 . f . t until the buffer is cleared. Note that the preselector and the 

FIG. 24 is schematic diagram of rectification for project- |andmark module ^ omi k d ^ ^ face 

mg corresponding pixels of stereo images along the same ^ 

t - , _ recognition process. ^ 

line numbers. 20 „ ° r _ , . . . t . 

^ , , . Screen output of the recognition process is shown in FIG. 

FIG. 25 are image frames showing a correlation matching 3 fof ^ ^ iandmark finding and identifier pro- 

pioc^betwwnawmdowofoneimageframeandasea^ ccsscs ^ left . shows an acquired 

wmdow of the other image frame. ^ ^ ^ dctcctcd hcad by a 

FIG. 26 are images of a stereo image pair, disparity map rectangle. The head image is centered, resized, and provided 

and image reconstruction illustrating 3-D image decoding. to ±c landmark finding process. The upper right image 

FIG. 27 is a flow chart an image identification process, window shows the output of the landmark finding module 

according to the invention. with the facial image marked with nodes on the facial 

FIG. 28 is an image showing the use of background landmarks. The marked image is provid ed to the iden tified 

suppression. 30 p rocess which" is illustrated in tne lower window.***Ttie 

leftmost image represents the selected face provided by the 

DETAILED DESCRIPTION OF THE landmark finding process for identification. The three right - 

PREFERRED EMBODIMENTS most represent the most similar gallery images 

The present invention is embodied in a method, and sorted in the order of similarity with the most similar face 

related apparatus, for detecting and recognizing an object in 35 being in the left-most position. Each gallery image carries a 

an image frame. The object may be, for example, a head tag (e.g., id number and person name) associated with the 

having particular facial characteristics. The object detection image. The system then reports the tag associated with the 

process uses robust and computationally efficient tech- most similar face, 

niques. The object identification and recognition process The face recognition process may be implemented using_ 

uses an image processing technique based on model graphs 40 a three dimensional (3D) reconstruction process 24 based on 

and bunch graphs that efficiently represent image features as stereo images. The 3D face recognition process provides 

jets. The jets are composed of wavelet transforms and are viewpoint independent recognition, 

processed at nodes or landmark locations on an image The image processing system 12 for implementing the 

corresponding to readily identifiable features. The system of face recognition processes of the invention is shown in FIG. 

the invention is particularly advantageous for recognizing a 45 2. The processing system receives a person's image from a 

person over a wide variety of pose angles. video source 26 which generates a stream of digital video 

An image processing system of the invention is described image frames. The video image frames are transferred into 
with reference to FIGS. 1-3. The object recognition process a video random-access memory (VPAM) 28 for processing. 
10 operates on digitized video image data provided by an A satisfactory imaging system is the Matrox Meteor II 
image processing system 12. The image data includes an 50 available from Matrox™ (Dorval, Quebec, Canada; 
image of an object class, such as a human face. The image www.matrox.com) which generates digitized images pro- 
data may be a single video image frame or a series of duced by a conventional CCD camera and transfers the 
sequential monocular or stereo image frames. images in real-time into the memory at a frame rate of 30 Hz. 

Before processing a facial image using elastic bunch Atypical resolution for an image frame is 256 pixels by 256 

graph techniques, the head in the image is roughly located, 55 pixels. The image frame is processed by an image processor 

in accordance with the invention, using a head detection and having a central processing unit (CPU) 30 coupled to the 

tracking process 14. Depending on the nature of the image VRAM and random-access memory (RAM) 32. The RAM 

data, the head detection module uses one of a variety of stores program code 34 and data for implementing the facial 

visual pathways which are based on, for example, motion, recognition processes of the invention. Alternatively, the 

color, or size (stereo vision), topology or pattern. The head 60 image processing system may be implemented in application 

detection process places a bounding box around the detected specific hardware. 

head thus reducing the image region that must be processed The head detection process is described in more detail 

by the landmark finding process. Based on data received with reference to FIG. 4. The facial image may be stored in 

from the head detection and tracking process, a preselector VRAM 28 as a single image 36, a monocular video stream 

process 16 selects the most suitable views of the image 65 of images 38 or a binocular video stream of images 40. 

material for further analysis and refines the head detection to For a single image, processing time may not be critical 

center and scale the head image. The selected head image is and elastic bunch graph matching, described in more detail 
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below, may be used to detect a face if the face covers at least- 
10% of the image and has a diameter of at least 50 pixels. 
If the face is smaller than 10% of the image or if multiple 
faces are present, a neural network based face detector may 
be use as described in H. A. Rowley, S. Baluja and X 
Kanade, "Rotation Invarient Neural Network-Based Face 
Detection", Proceedings Computer Vision and Pattern 
Recognition, 1998. If the image includes color information, 
a skin color detection process may be used to increase the 
reliability of the face detection. The skin color detection 
process may be based on a look-up table that contains 
possible skin colors. Confidence values which indicate the 
reliability of face detection and which are generated during 
bunch graph matching or by the neural network, may be 
increased for skin-colored image regions. 

A monocular image stream of at least 10 frames per 
second may be analyzed for image motion, particularly if the 
image stream includes only a single person that is moving in 
front of a stationary background. One technique for head 
tracking involves the use of difference images to determine 
which regions of an image have been moving. 

As described in more detail below with respect to bin- 
ocular images, head motion often results in a difference 
image having a convex regions within a motion silhouette. 
This motion silhouette technique can readily locate and track 
head motion if image includes a single person in an upright 
position in front of a static background. A clustering algo- 
rithm groups moving regions into clusters. Hie top of the 
highest cluster that exceeds a minimal threshold size and 
diameter is considered the head and marked. 

Another advantageous use of head motion detection uses 
graph matching which is invoked only when the number of 
pixels affected by image motion exceeds a minimal thresh- 
old. The threshold is selected such that the relatively time 
consuming graph matching image analysis is performed 
only if sufficient change in the image justifies a renewed 
indepth analysis. Other techniques for determining convex 
regions of a noisy motion silhouette may be used such as, for 
example, Turk et al, "Eignefaces for Recognition", Journal 
of Cognitive Neuroscience, VbL 3, No. 1 p. 71, 1991. Optical 
flow methods, as described in D. J. Fleet, "Measurement of 
Image Velocity", Khiwer International Series in Engineering 
and Computer Science, No. 169, 1992, provide an alterna- 
tive and reliable means to determine which image regions 
change but are computationally more intensive. 

With reference to FIG. 5, reliable and fast head and face 
detection is possible using an image stream of stereo bin- 
ocular video images (block 50). Stereo vision allows for 
discrimination between foreground and background objects 
and it allows for determining object size for objects of a 
known size, such as heads and hands. Motion is detected 
between two images in an image series by applying a 
difference routine to the images in both the right image 
channel and the left image channel (block 52). A disparity 
map is computed for the pixels that move in both image 
channels (block 54). The convex detector next uses disparity 
histograms (block 56) that show the number of pixels 
against the disparity. The image regions having a disparity 
confined to a certain disparity interval are selected by 
inspecting the local maxima of the disparity histogram 
(block 58). The pixels associated with a local maxima are 
referred to as motion silhouettes. The motion silhouettes are 
binary images. 

Some motion silhouettes may be discarded as too small to 
be generated by a person (block 60). The motion silhouette 
associated with a given depth may distinguish a person from 
other moving objects (block 62). 
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The convex, regions of the motion silhouette (block 64) 
are detected by a convex detector as shown in FIG. 6. The 
convex detector analyzes convex regions within the silhou- 
ettes. The convex detector checks whether a pixel 68 that 

5 belongs to a motion silhouette having neighboring pixels 
that are within an allowed region 70 on the circumference or 
width of the disparity 72. The connected allowed region can 
be located in any part of the circumference. The output of the 
convex detector is a binary value. 

10 Skin color silhouettes may likewise be used for detecting 
heads and hands. The motion silhouettes, skin color 
silhouettes, outputs of the convex detectors applied to the 
motion silhouettes and outputs of the convex detectors 
applied to the skin color silhouettes, provide four different 
evidence maps. An evidence map is a scalar function over 

15 the image domain that indicates the evidence thai a certain 
pixel belongs to a face or a hand. Each of the four evidence 
maps is binary valued. The evidence maps are linearly 
superimposed for a given disparity and checked for local 
maxima. The local maxima indicate candidate positions 

20 where heads or hands might be found. The expected diam- 
eter of a head then may be inferred from the local maximum 
in the disparity map that gave rise to the evidence map. Head 
detection as described performs well even in the presence of 
strong background motion. 

25 The head tracking process (block 42) generates head 
position information that may be used to generate head 
trajectory checking. As shown in FIG. 7, newly detected 
head positions (block 78) may be compared with existing 
head trajectories. A thinning (block 80) takes place that 

30 replaces multiple nearby detections by a single representa- 
tive detection (block 82). The new position is checked to 
determine whether the new estimated position belongs to an 
already existing trajectory (block 84) assuming spatio- 
temporal continuity. For every position estimate found for 

35 the frame acquired at time t, the algorithm looks (block 86) 
for the closest head position estimate that was determined 
for the previous frame at time t-1 and connects it (block 88). 
If an estimate that is sufficiently close can not be found, it is 
assumed that a new head appeared (block 90) and a new 

40 trajectory is started. To connect individual estimates to 
trajectories, only image coordinates are used. 

Every trajectory is assigned a confidence which is updated 
using a leaky integrator. If the confidence value falls below 
a predetermined threshold, the trajectory is deleted (block 
92). A hysteresis mechanism is used to stabilize trajectory 
creation and deletion. In order to initiate a trajectory (block 
90), a higher confidence value must to be reached than is 
necessary to delete a trajectory. 

5Q The preselector 16 (FIG. 2) operates to select suitable 
images for recognition from a series of images belonging to 
the same trajectory. This selection is particularly useful if the 
computational power of the hardware is not sufficient to 
analyze each image of a trajectory individually. However, if 

55 available computation power is sufficient to analyze all faces 
found it may not be necessary to employ the preselector. 

The preselector 16 receives input from the bead tracking 
process 14 and provides output to the landmark finding 
process 18. The input may be: 

5Q A monocular gray value image of 256x256 pixel size 
represented by a 2 dimensional array of bytes. 
An integer number representing the sequence number of 
the image. This number is the same for all images 
belonging to the same sequence. 

65 Four integer values representing the pixel coordinates of 
the upper left and lower right corners of a square- 
shaped bounding rectangle that surrounds the face. 
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The preselector's output may.be: the model graph left to right starting* from the- upper-left 

Selected monocular gray value image from the previous corner of the image (block 148). When a rough position of 

sequence. the face is found (block 150), the nodes are individually 

Four integer values representing the pixel coordinates of allowed to move, introducing elastic graph distortions 

the upper left and lower right corners of a square- 5 152 ) A phase-insensitive similarity function, dis- 

shapcd bounding rectangle that represents the face cussed below, is used m order to locate a good match (block 

position in a more accurate way compared to the 154)* A phase-sensitive similarity function is then used to 

rectangle that Preselector accepts as input. locate a J et accuracy because the phase is very sensitive 

As shown in FIG. 8, the preselector 16 processes a series «> small jet displacements. The phase-insensitive and the 

of face candidates that belong to the same trajectory as 10 phase-sensitive similarity functions are described below 

determined by the head tracking process 14 (block 100). with respect tor FIGS. 10-13. Note that although the graphs 

Elastic bunch graph matching, as described below with are shown in FIG. 9 with respect to the original image, the 

respect to landmark finding, is applied (block 102) to this model graph movements and matching are actually per- 

c- * T ■ i_- -4 c- * if *i- formed on the transformed image, 

sequence of images that contain an object of interest (e.g. the Tfae wavekt ^ ^ rcfercncc to rq. 

head of a person) in order to select the most suUable images 15 10 ^ ori ^ al ^ ^ processed ^ a Gabor wavelet t0 

for further processing (i.e. Landmark finmng/Recognition). ate a result nc Gabor-b^sed wavelet, 

The preselector applies graph matching in order to evaluate consists 0 f a two-dimensional complex wave field modu- 

each image by quality. Additionally, the matching result lated by a Gaussian envelope, 
provides more accurate information about the position and 

size of the face than the head detection module. Confidence 2 o k z 2 4 j -<tM W 

values generated by the matching procedure are used as a ^u) = -^e~* *?le?* - c~r 1 " 

measure of suitability of the image. Preselector submits an * ^ 

image to the next module if its confidence value exceeds the 

best confidence value measured so far in the current . 

sequence (block 104-110). The preselector bounds the 25 The wavelet is a plane wave with wave vector k restneted 

detected image by a bounding box and provides the image b V a , Gau f ian ™*°»> ^ of * mch relaUv l l «? the 

to the landmark finding process 18. The subsequent process wavelength * parameterized by a. The term in the brace 

_ ° i . w * * ■ . ** removes the DC component. The amplitude of the wavevec- 

starts processing on each mcommg image but terminates if be P foUows y fa related tQ ^ 

an image having a higher confidence value (measured bythe desifed resolutions> 

preselector) comes from within the same sequence. This 30 

may lead to increased CPU workload but yields preliminary ^ 2 

results faster. A, =2^*, *= l, 2, ... 

Accordingly, the Preselector filters out a set of most 

suitable images for further processing. The preselector may _^ 

alternatively evaluate the images as follows: 35 A wavelet, centered at image position x is used to extract 

The subsequent modules (e.g. landmarker, identifier) wait me wavelet component J7 from the image with gray level 

until the sequence has finished in order to select the last distribution I("x ), 

and therefore most promising image approved by pre- _^ 

selector. This leads to low CPU workload but implies a o£)=jd~Zi(7ym(I-7') (3) 

time delay until the final result (e.g. recognition) is 40 -» 

available The space of wave vectors k is typically sampled m a 
Tlie subsequent modules lake each image approved by d^ete hierarchy of 5 resolution leveb (differing by half- 
preselector, evaluate it individually, and leave final ?f ' ave f> ' lod S orientations at each resolution level (See e.g^ 
selection to the following modules (e.g. by recognition ^ , gyrating 40 complex values for each 
confidence). This also yields fast preliminary Jesuits. 45 sam P\ ed ma 8 e P 01 ? 1 ( ,he ^ ^ «™ponente 
The final recognition result in this case may change ^ emD S l ° the . cos l me ™ d sme ^ °?? C J l ™ W ^ e) " 
within one sequence, yielding in the end better recog- V» sam P les ' n 11 k " s P a f , * the •P** 
nition rate. However, this approach requires the most J." 1, ' ■ ' • 40 t and M w ™ A compon * n,s Tl. 

amount of CPU time among the three evaluation alter- J?"? ^ "? f ' w f I ' J 

_ 4 . „ 60. Each jet describes the local features of the area sur- 

natives. 50 

The facial landmarks and features of the head may be rounding x . If sampled with sufficient density, the image 

located using an elastic graph matching technique shown in may be reconstructed from jets within the bandpass covered 

FIG. 9. In the elastic graph matching technique, a captured by the sampled frequencies. Thus, each component of a jet 

image (block 140) is transformed into Gabor space using a is the filter response of a Gabor wavelet extracted at a point 

wavelet transformation (block 142) which is described 55 (x, y) of the image. 

below in more detail with respect to FIG. 10. The trans- A labeled image graph 162, as shown in FIG. 11, is used 

formed image (block 144) is represented by 40 complex to describe the aspects of an object (in this context, a face), 

values, representing wavelet components, per each pixel of The nodes 164 of the labeled graph refer to points on the 

the original image. Next, a rigid copy of a model graph, object and are labeled by jets 160. Edges 166 of the graph 

which is described in more detail below with respect to FIG. 60 are labeled with distance vectors between the nodes. Nodes 

12, is positioned over the transformed image at varying and edges define the graph topology. Graphs with equal 

model node positions to locate a position of optimum geometry may be compared. The normalized dot product of 

similarity (block 146). The search for the optimum similarity the absolute components of two jets defines the jet similarity, 

may be performed by positioning the model graph in the This value is independent of the illumination and contrast 

upper left hand comer of the image, extracting the jets at the 65 changes. To compute the similarity between two graphs, the 

nodes, and determining the similarity between the image sum is taken over similarities of corresponding jets between 

graph and the model graph. The search continues by sliding the graphs. 
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: A model graph 168 that is., particularly .-.designed for comparing two jets during graph matching, .-the similarity* 

finding a human face in an image is shown in FIG. 12. The between them is maximized with respect to d, leading to an 

numbered nodes of the graph have the following locations: accurate determination of jet position. Both similarity func- 

0 right eye pupil tions are used, with preference often given to the phase- 

1 left eye pupil 5 insensitive version (which varies smoothly with relative 

2 top of the nose position), when first matching a graph, and given to the 

3 right comer of the right eyebrow phase-sensitive version when accurately positioning the jet. 

4 left corner of the right eyebrow A course-to-fine landmark finding approach, shown in 

5 right corner of the left eyebrow FIG. 14, uses graphs having fewer nodes and kernel on 

6 left corner of the left eyebrow io lower resolution images. After coarse landmark finding has 

7 right nostril been achieved, higher precision localization may be per- 

8 tip of the nose formed on higher resolution images for precise finding of a 

9 left nostril particular facial feature. 

10 right corner of the mouth The responses of Gabor convolutions are complex num- 

11 center of the upper lip 15 bers which are usually stored as absolute and phase values 

12 left corner of the mouth because comparing Gabor jets may be performed more 

13 center of the lower lip efficiently if the values are represented in thai domain rather 

14 bottom of the right ear than in the real-imaginary domain. Typically the absolute 

15 top of the right ear and phase values are stored as 'float' values. Calculations are 

16 top of the left ear 20 then performed using float-based arithmetic. The phase 

17 bottom of the left ear value ranges within a range of -it to n where -it equals n so 
lb represent a face, a data structure called bunch graph 170 that the number distribution can be displayed on a circular 
is used. It is similar to the graph described above, but instead axis as shown in FIG. 15. Whenever the phase value exceeds 
of attaching only a single jet to each node, a whole bunch of this range, i.e. due to an addition or subtraction of a constant 
jets 172 (a bunch jet) are attached to each node. Each jet is 25 phase value, the resulting value must be readjusted to within 
derived from a different facial image. To form a bunch this range which requires more computational effort than the 
graph, a collection of facial images (the bunch graph gallery) float-addition alone. 

is marked with node locations at defined positions of the The commonly used integer representation and related 

head. These defined positions are called landmarks. When arithmetic provided by most processors is the two's comple- 

matching a bunch graph to an image, each jet extracted from 30 ment. Since this value has a finite range, overflow or 

the image is compared to all jets in the corresponding bunch underflow may occur in addition and subtraction operations, 

attached to the bunch graph and the best-matching one is The maximum positive number of a 2-byte integer is 32767. 

selected. This matching process is called elastic bunch graph Adding 1 yields a number that actually represents -32768. 

matching. When constructed using a judiciously selected Hence the arithmetic behavior of the two's complement 

gallery, a bunch graph covers a great variety of faces that 35 integer is very close to the requirements for phase arith- 

may have significant different local properties. metic. Therefore, we may represent phase values by 2-byte 

In order to find a face in an image frame, the graph is integers. Phase values j are mapped into integer values I as 

moved and scaled over the image frame until a place is shown in FIG. 16. The value in the range of to jc is rarely 

located at which the graph matches best (the best fitting jets required during matching and comparison stages described 

within the bunch jets are most similar to jets extracted from 40 later. Therefore the mapping between [-Jt, n] and [-32768, 

the image at the current positions of the nodes). Since face 32768] does not need to be computed very often. However 

features differ from face to face, the graph is made more phase additions and subtractions occur very often. These 

general for the task, e.g., each node is assigned with jets of compute much faster using the processor adapted interval, 

the corresponding landmark taken from 10 to 100 individual Therefore this adaptation technique can significantly 

faces. 45 improve the calculation speed of the processor. 

If the graphs have relative distortion, a second term that After the facial features and landmarks are located, the 

accounts for geometrical distortions may be introduced. Two facial features may be tracked over consecutive frames as 

different jet similarity functions are used for two different, or illustrated in FIGS. 17 and 18. The tracking technique of the 

even complementary, tasks. If the components of a jet ? are invention achieves robust tracking over long frame 

written in the form with amplitude a, and phase the 50 sequencesby usmg a trackmg correction scheme mat detects 

_^ _„ ' ' whether tracking of a feature or node has been lost and 

similarity of two jets T and J ' is the normalized scalar reinitializes the tracking process for that node, 

product of the amplitude vector: The position X_n of a single node in an image I_n of an 

image sequence is known either by landmark finding on 

v Yi a Pi <4) 55 mia S e n using the landmark finding method (block 180) 

' ) - i . = described above, or by tracking the node from image I__(n- 

v% j£ a j 1) to L_n using the tracking process. The node is then 

tracked (block 182) to a corresponding position XL_(n+l) in 

t, • ., c. *- u .u c the image I (n+1) by one of several techniques. The track- 

The other similarity function has the form * *~ i n. * u i j_ . i j * 

60 ing methods described below advantageously accommodate 

. fast motion. 

Zj a i a 'f xs ^i ~ &j ~ dk i) A first tracking technique involves linear motion predic- 

yjz ajZ x * 011, s^rch f° r me corresponding node position X_(n+ 

' ' 1) in the new image I__(n+1) is started at a position gener- 

65 ated by a motion estimator. A disparity vector (X_n-X_ 

This function includes a relative displacement vector (n-1)) is calculated that represents the displacement, 

between the image points to which the two jets refer. When assuming constant velocity, of the node between the pre- 
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ceeding two frames.. The. disparity or displacement vector 
D_n may be added to the position X_n to predict the node 
position X_(n+1). This linear motion model is particularly 
advantageous for accommodating constant velocity motion. 
The linear motion model also provides good tracking if the 
frame rate is high compared to the acceleration of the objects 
being tracked. However, the linear motion model performs 
poorly if the frame rate is too low so that strong acceleration 
of the objects occurs between frames in the image sequence. 
Because it is difficult for any motion model to track objects 
under such conditions, use of a camera having a higher 
frame rates is recommended. 

Hie linear motion model may generate too large of an 
estimated motion vector D_n which could lead to an 
accumulation of the error in the motion estimation. 
Accordingly, the linear prediction may be damped using a 
damping factor fUD. The resulting estimated motion vector 
is D__n-f_JD*(X__n-X_(n-l)). A suitable damping factor 
is 0.9. If no previous frame I_(n-1) exists, e.g., for a frame 
immediately after landmark finding, the estimated motion 
vector is set equal to zero (D_n«0). 

A tracking technique based on a Gaussian image pyramid, 
applied to one dimension, is illustrated in FIG. 19. Instead 
of using the original image resolution, the image is down 
sampled 2-4 times to create a Gaussian pyramid of the 
image. An image pyramid of 4 levels results in a distance of 
24 pixels on the finest, original resolution level being 
represented as only 3 pixels on the coarsest level. Jets may 
be computed and compared at any level of the pyramid. 

Tracking of a node on the Gaussian image pyramid is 
generally performed first at the most coarse level and then 
proceeding to the most fine level. A jet is extracted on the 
coarsest Gauss level of the actual image frame I_(n+1) at 
the position X_(n+1) using the damped linear motion 
estimation X_(n+l)=(X__n+D_n) as described above, and 
compared to the corresponding jet computed on the coarsest 
Gauss level of the previous image frame. From these two 
jets, the disparity is determined, i.e., the 2D vector R 
pointing from X_(n+1) to that position that corresponds 
best to the jet from the previous frame. This new position is 
assigned to X_(n+1). The disparity calculation is described 
below in more detail. The position on the next finer Gauss 
level of the actual image (being 2*X_(n+l)), corresponding 
to the position X_(n+1) on the coarsest Gauss level is the 
starting point for the disparity computation on this next finer 
level. The jet extracted at this point is compared to the 
corresponding jet calculated on the same Gauss level of the 
previous image frame. This process is repeated for all Gauss 
levels until the finest resolution level is reached, or until the 
Gauss level is reached which is specified for determining the 
position of the node corresponding to the previous frame's 
position. 

1\vo representative levels of the Gaussian image pyramid 
are shown in FIG. 19, a coarser level 194 above, and a finer 
level 196 below. Each jet is assumed to have filter responses 
for two frequency levels. Starting at position 1 on the coarser 

Gauss level, X (n+l)=X n+D n, a first disparity move 

using only the lowest frequency jet coefficients leads to 
position 2. A second disparity move by using all jet coeffi- 
cients of both frequency levels leads to position 3, the final 
position on this Gauss level. Position 1 on the finer Gauss 
level corresponds to position 3 on the coarser level with the 
coordinates being doubled. The disparity move sequence is 
repeated, and position 3 on the finest Gauss level is the final 
position of the tracked landmark. 

After the new position of the tracked node in the actual 
image frame has been determined, the jets on all Gauss 
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levels are computed at this position/ A stored array of. jets. . . 
that was computed for the previous frame, representing the 
tracked node, is then replaced by a new array of jets 
computed for the current frame. 

5 Use of the Gauss image pyramid has two main advan- 
tages: First, movements of nodes are much smaller in terms 
of pixels on a coarser level than in the original image, which 
makes tracking possible by performing only a local move 
instead of an exhaustive search in a large image region. 

10 Second, the computation of jet components is much faster 
for lower frequencies, because the computation is performed 
with a small kernel window on a down sampled image, 
rather than on a large kernel window on the original reso- 
lution image. 

15 Note, that the correspondence level may be chosen 
dynamically, e.g., in the case of tracking facial features, 
correspondence level may be chosen dependent on the actual 
size of the face. Also the size of the Gauss image pyramid 
may be altered through the tracking process, i.e., the size 

20 may be increased when motion gets faster, and decreased 
when motion gets slower. Typically, the maximal node 
movement on the coarsest Gauss level is limited to a range 
of 1 to 4 pixels. Also note that the motion estimation is often 
performed only on the coarsest level. 

25 The computation of the displacement vector between two 
given jets on the same Gauss level (the disparity vector), is 
now described. To compute the displacement between two 
consecutive frames, a method is used which was originally 
developed for disparity estimation in stereo images, based 

30 on D. J. Fleet and A D. Jepson, "Computation of component 
image velocity from local phase information", International 
Journal of Computer Vision, volume 5, issue 1, pages 
77-104, 1990 and on W. M. Tneimer and H. A. Mallot, 
"Phase-based binocular vergence control and depth recon- 

35 struction using active vision", CVGIPrlmage 
Understanding, volume 60, issue 3, pages 343-358, Novem- 
ber 1994. The strong variation of the phases of the complex 
filter responses is used explicitly to compute the displace- 
ment with subpixel accuracy (See, Wiskott, L., "Labeled 

40 Graphs and Dynamic Link Matching for Face Recognition 
and Scene Analysis", \ferlag Harri Deutsch, Thun-Frankfurt 
am Main, Reihe Physik 53, PhD Thesis, 1995). By writing 
the response J to the jth Gabor filter in terms of amplitude 
Oy and phase fy, a similarity function can be defined as 

45 

SiJ, J',d)=^ 

50 

Let J and J 1 and be two jets at positions X and X'=X4d, the 
displacement d may be found by maximizing the similarity 
S with respect to d, the k^ being the wavevectors associated 

55 with the filter generating J y . Because the estimation of d is 
only precise for small displacements, i.e., large overlap of 
the Gabor jets, large displacement vectors are treated as a 
first estimate only, and the process is repeated in the fol- 
lowing manner. First, only the filter responses of the lowest 

60 frequency level are used resulting in a first estimate d_l. 
Next, this estimate is executed and the jet J is recomputed at 
the position X_l=X+d_l, which is closer to the position X* 
of jet J'. Then, the lowest two frequency levels are used for 
the estimation of the displacement d_2, and the jet J is 

65 recomputed at the position X__2=X_1 +d_2. This is iter- 
ated until the highest frequency level used is reached, and 
the final disparity d between the two start jets J and J' is 
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given as the sum d«d — 1-kL-2+.-. . .. Accordingly, displace- 
ments of up to half the wavelength of the kernel with the 
lowest frequency may be computed (see Wiskott 1995 
supra). 

Although the displacements are determined using floating 5 
point numbers, jets may be extracted (i.e., computed by 
convolution) at (integer) pixel positions only, resulting in a 
systematic rounding error. To compensate for this subpixel 
error Ad , the phases of the complex Gabor filter responses 
should be shifted according to 10 



A$rj-Ad-k y 



(6) 



so that the jets will appear as if they were extracted at the 
correct subpixel position. Accordingly, the Gabor jets may 
be tracked with subpixel accuracy without any further 15 
accounting of rounding errors. Note that Gabor jets provide 
a substantial advantage in image processing because the 
problem of subpixel accuracy is more difficult to address in 
most other image processing methods. 

Tracking error also may be detected by determining 20 
whether a confidence or similarity value is smaller than a 
predetermined threshold (block 184 of FIG. 17). The simi- 
larity (or confidence) value S may be calculated to indicate 
how well the two image regions in the two image frames 
correspond to each other simultaneous with the calculation 25 
of the displacement of a node between consecutive image 
frames. Typically, the confidence value is close to 1, indi- 
cating good correspondence. If the confidence value is not 
close to 1, either the corresponding point in the image has 
not been found (e.g., because the frame rate was too low 30 
compared to the velocity of the moving object), or this 
image region has changed so drastically from one image 
frame to the next, that the correspondence is no longer well 
defined (e.g., for the node tracking the pupil of the eye the 
eyelid has been closed). Nodes having a confidence value 35 
below a certain threshold may be switched off. 

A tracking error also may be detected when certain 
geometrical constraints are violated (block 186). If many 
nodes are tracked simultaneously, the geometrical configu- 
ration of the nodes may be checked for consistency. Such 40 
geometrical constraints may be fairly loose, e.g., when facial 
features are tracked, the nose must be between the eyes and 
the mouth. Alternatively, such geometrical constraints may 
be rather accurate, e.g., a model containing the precise shape 
information of the tracked face. For intermediate accuracy, 45 
the constraints may be based on a flat plane model. In the flat 
plane model, the nodes of the face graph are assumed to be 
on a flat plane. For image sequences that start with the 
frontal view, the tracked node positions may be compared to 
the corresponding node positions of the frontal graph trans- 50 
formed by an affine transformation to the actual frame. The 
6 parameters of the optimal affine transformation are found 
by minimizing the least squares error in the node positions. 
Deviations between the tracked node positions and the 
transformed node positions are compared to a threshold. The 55 
nodes having deviations larger than the threshold are 
switched off. The parameters of the affine transformation 
may be used to determine the pose and relative scale 
(compared to the start graph) simultaneously (block 188). 
Thus, this rough flat plane model assures that tracking errors 60 
may not grow beyond a predetermined threshold. 

If a tracked node is switched off because of a tracking 
error, the node may be reactivated at the correct position 
(block 190), advantageously using bunch graphs that include 
different poses and tracking continued from the corrected 65 
position (block 192). After a tracked node has been switched 
off, the system may wait until a predefined pose is reached 



for which a pose specific bunch«graph exists.. Otherwise,. if ., 
only a frontal bunch graph is stored, the system must wait 
until the frontal pose is reached to correct any tracking 
errors. The stored bunch of jets may be compared to the 
image region surrounding the fit position (e.g., from the flat 
plane model), which works in the same manner as tracking, 
except that instead of comparing with the jet of the previous 
image frame, the comparison is repeated with all jets of the 
bunch of examples, and the most similar one is taken. 
Because the facial features are known, e.g., the actual pose, 
scale, and even the rough position, graph matching or an 
exhaustive searching in the image and/or pose space is not 
needed and node tracking correction may be performed in 
real time. 

For tracking correction, bunch graphs are not needed for 
many different poses and scales because rotation in the 
image plane as well as scale may be taken into account by 
transforming either the local image region or the jets of the 
bunch graph accordingly as shown in FIG. 20. In addition to 
the frontal pose, bunch graphs need to be created only for 
rotations in depth. 

The speed of the reinitialization process may be increased 
by taking advantage of the fact that the identity of the 
tracked person remains the same during an image sequence. 
Accordingly, in an initial learning session, a first sequence of 
the person may be taken with the person exhibiting a full 
repertoire of frontal facial expressions. This first sequence 
may be tracked with high accuracy using the tracking and 
correction scheme described above based on a large gener- 
alized bunch graph that contains knowledge about many 
different persons. This process may be performed offline and 
generates a new personalized bunch graph. The personalized 
bunch graph then may be used for tracking this person at a 
fast rate in real time because the personalized bunch graph 
is much smaller than the larger, generalized bunch graph. 

The speed of the reinitialization process also may be 
increased by using a partial bunch graph reinitialization. A 
partial bunch graph contains only a subset of the nodes of a 
full bunch graph. The subset may be as small as only a single 
node. 

A pose estimation bunch graph makes use of a family of 
two-dimensional bunch graphs defined in the image plane. 
The different graphs within one family account for different 
poses and/or scales of the head. The landmark finding 
process attempts to match each bunch graph from the family 
to the input image in order to determine the pose or size of 
the head in the image. An example of such pose-estimation 
procedure is shown in FIG. 21. The first step of the pose 
estimation is equivalent to that of the regular landmark 
finding. The image (block 198) is transformed (blocks 200 
and 202) in order to use the graph similarity functions. Then, 
instead of only one, a family of three bunch graphs is used. 
The first bunch graph contains only the frontal pose faces 
(equivalent to the frontal view described above), and the 
other two bunch graphs contain quarter-rotated faces (one 
representing rotations to the left and one to the right). As 
before, the initial positions for each of the graphs is in the 
upper left comer, and the positions of the graphs are scanned 
on the image and the position and graph returning the 
highest similarity after the landmark finding is selected 
(blocks 204-214). 

After initial matching for each graph, the similarities of 
the final positions are compared (block 216). The graph that 
best corresponds to the pose given on the image will have 
the highest similarity (block 218). In FIG. 21, the left-rotated 
graph provides the best fit to the image, as indicated by its 
similarity. Depending on resolution and degree of rotation of 
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the face;in the picture, similarity. oLlhe. correct. graph and 
graphs for other poses would vary, becoming very close 
when the face is about half way between the two poses for 
which the graphs have been defined. By creating bunch 
graphs for more poses, a finer pose estimation procedure 
may be implemented that would discriminate between more 
degrees of head rotation and handle rotations in other 
directions (e.g. up or down). 

In order to robustly find a face at an arbitrary distance 
from the camera, a similar approach may be used in which 
two or three different bunch graphs each having different 
scales may be used. The face in the image will be assumed 
to have the same scale as the bunch graph that returns the 
most to the facial image. 

A three-dimensional (3D) landmark finding techniques 
related to the technique described above also may use 
multiple bunch graphs adapted to different poses. However, 
the 3D approach employs only one bunch graph defined in 
3D space. The geometry of the 3D graph reflects an average 
face or head geometry. By extracting jets from images of the 
faces of several persons in different degrees of rotation, a 3D 
bunch graph is created which is analogous to the 2D 
approach. Each jet is now parametrized with the three 
rotation angles. As in the 2D approach, the nodes are located 
at the fiducial points of the head surface. Projections of the 
3D graph arc then used in the matching process. One 
important generalization of the 3D approach is that every 
node has the attached parameterized family of bunch jets 
adapted to different poses. The second generalization is that 
the graph may undergo Euclidean transformations in 3D 
space and not only transformations in the image plane. 

The 3D graph matching process may be formulated as a 
coarse-to-fine approach that first utilizes graphs with fewer 
nodes and kernels and then in subsequent steps utilizes more 
dense graphs. The coarse-to-fine approach is particularly 
suitable if high precision localization of the feature points in 
certain areas of the face is desired. Thus, computational 
effort is saved by adopting a hierarchical approach in which 
landmark finding is first performed on a coarser resolution, 
and subsequently the adapted graphs are checked at a higher 40 
resolution to analyze certain regions in finer detail. 

Further, the computational workload may be easily split 
on a multi-processor machine such that once the coarse 
regions are found, a few child processes start working in 
parallel each on its own part of the whole image. At the end 45 
of the child processes, the processes communicate the fea- 
ture coordinates that they located to the master process, 
which appropriately scales and combines them to fit back 
into the original image thus considerably reducing the total 
computation time. 

A number of ways have been developed to construct 
texture mapped 3D models of heads. This section describes 
a stereo-based approach. The stereo-based algorithms are 
described for the case of fully calibrated cameras. The 
algorithms perform area based matching of image pixels and 
are suitable in the case that dense 3-D information is needed. 
It then may be used to accurately define a higher object 
description. Further background information regarding ste- 
reo imaging and matching may be found in U. Dhond and J. 
Aggrawal, "Structure from Stereo: a Review", IEEE Trans- 
actions on Systems, Man, and Cybernetics, 19(6), pp. 
1489-1510, 1989, or more recently in R. Sara and R. Bajcsy, 
"On Occluding Contour Artifacts in Stereo Vision", Proc. 
Int. Conf. Computer Vision and Pattern Recognition, IEEE 
Computer Society, Puerto Rico, 1997.; M. Okutomi and T. 
Kanade, "Multiple-baseline Stereo", IEEE Trans, on Pattern 
Analysis and Machine Intelligence, 15(4), pp. 353-363, 
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,1993; P. Belhumeur, "A Bayesian Approach to. Binocular. 
Stereopsis'", Intl. J. of Computer Vision, 19(3), pp. 
237-260, 1996; Roy, S. and Cox, I., "Maximum-How 
Formulation of the N-camera Stereo Correspondence 
Problem", Proc. Int. Conf. Computer Vision, Narosa Pub- 
lishing House, Bombay, India, 1998; Scharstein, D. and 
Szeliski, R., "Stereo Matching with Non-Linear Diffusion", 
Proc. Int. Conf. Computer Vision and Pattern Recognition, 
IEEE Computer Society, San Francisco, Calif., 1996; and 
Tomasi, C. and Manduchi, R., "Stereo without Search", 
Proc. European Conf. Computer Vision, Cambridge, UK, 
1996. 

An important issue in stereoscopy is known as the cor- 
respondence (matching) problem; i.e. to recover range data 
from binocular stereo, the corresponding projections of the 
spatial 3-D points have to be found in the left and right 
images. To reduce the search-space dimension the epipolar 
constraint is applied (See, S. Maybank and O. Faugeras, "A 
Theory of Self-Calibration of a Moving Camera", Intl. J. of 
Computer Vision, 8(2), pp. 123-151, 1992. Stereoscopy can 
be formulated in a four-step process: 

Calibration: compute the camera's parameters. 
Rectification: the stereo-pair is projected, so that corre- 
sponding features in the images lie on same lines. 
These lines are called epipolar lines. This is not abso- 
lutely needed but greatly improves the performance of 
the algorithm, as the matching process can be 
performed, as a one-dimensional search, along hori- 
zontal lines in the rectified images. 
Matching: a cost function is locally computed for each 
position in a search window. Maximum of correlation 
is used to select corresponding pixels in the stereo pair. 
Reconstruction: 3-D coordinates are computed from 
matched pixel coordinates in the stereo pair. Post- 
processing may be added right after the matching in 
order to remove matching errors. Possible errors result 
from matching ambiguities mostly due to the fact that 
the matching is done locally. Several geometric con- 
straints as well as filtering may be applied to reduce the 
number of false matches. When dealing with continu- 
ous surfaces (a face in frontal position for instance) 
interpolation may also be used to recover non-matched 
areas (mostly non-textured areas where the correlation 
score does not have a clear monomodal maximum). 
The formalism leading to the equations used in the 
rectification and in the reconstruction process is called 
projective geometry and is presented in details in 0. 
Faugeras, "Three-Dimensional Computer Vision, A Geo- 
metric Viewpoint", MET Press, Cambridge, Massachusetts, 
1993. The model used provides significant advantages. 
Generally, a simple pinhole camera model, shown in FIG. 
22, is assumed. If needed, lens distortion can also be 
computed at calibration time (the most important factor 
being the radial lens distortion). From a practical point of 
view the calibration is done using a calibration aid, i.e. an 
object with known 3-D structure. Usually, a cube with 
visible dots or a squared pattern is used as a calibration aid 
as shown in FIG. 23. 

To simplify the rectification algorithms, the input images 
of each stereo pair are first rectified, (see, N. Ayache and C. 
Hansen, "Rectification of Images for Binocularand Trinocu- 
lar Stereovision", Proc. of 9th International Conference on 
Pattern Recognition, 1, pp. 11-16, Italy, 1988), so that 
corresponding points lie on the same image lines. Then, by 
definition, corresponding points have coordinates (u^, v^) 
and (u^-d, v L ), in left and right rectified images, where "d" 
is known as the disparity. For details on the rectification 
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process referto Faugeras, supra. The choice of the rectifying v . 
plane (plane used to project the images to obtain the rectified 
images) is important. Usually this plane is chosen to mini- 
mize the distortion of the projected images, and such that 
corresponding pixels are located along the same line number 5 
(epipolar lines are parallel and aligned) as shown in FIG. 24. 
Such a configuration is called standard geometry. 

With reference to FIG. 26, matching is the process of 
finding corresponding points in left and right images. Sev- 
eral correlation functions may be used to measure this 10 
disparity; for instance the normalized cross-correlation (see, 
H. Moravec, "Robot Rover Visual Navigation", Computer 
Science: Artificial Intelligence, pp. 13-15, 105-108, UMI 
Research Press 1980/1981) is given by: 

. 15 

cGt, I«>2 cov(I^ I^CvarCIJ+varCI^) (6) 

Where l L and l R are the left and right rectified images. The 
correlation function is applied on a rectangular area at point 
( u i.» V J (u*, Vyj). The cost function cflj,, Ir) is computed, 
as shown in FIG. 25 for the search window that is of size 20 
lxN (because of the rectification process), where N is some 
admissible integer. For each pixel (u^, v L ) in the left image, 
the matching produces a correlation profile c(u Lt v L , d) 
where "d" is defined as the disparity at the point (u^, y L % i.e.: 



dL-0 



The second equation expresses the fact that epipolar lines 
are aligned. As a result the matching procedure outputs a 
disparity map, or an image of disparities that can be super- 
imposed to a base image (here the left image of the stereo 
pair). The disparity map tells "how much to move along the 
epipolar line to find the corespondent of the pixel in the right 
image of the stereo pair". 

Several refinements may be used at matching time. For 
instance a list of possible corespondents can be kept at each 
point and constraints such as the visibility constraint, order- 
ing constraint, and disparity gradient constraint (see, A. 
Yuille and T. Poggio, "A Generalized Ordering Constraint 
for Stereo Correspondence", MIT, Artificial Intelligence 
Laboratory Memo, No. 777, 1984; Dhond et al., supra; and 
Faugeras, supra.) can be used to remove impossible con- 
figurations (see, R. Sara et al.,1997, supra). One can also use 
cross-matching, where the matching is performed from left 
to right then from right to left, and a candidate (correlation 
peak) is accepted if both matches lead to the same image 
pixel, Le. if, 



(9) 



where is the disparity found matching left to right and 
dj^ right to left. Moreover a pyramidal strategy can used to 
help the whole matching process by restraining the search 
window. This is implemented carrying the matching at each 
level of a pyramid of resolution, using the estimation of the 
proceeding level. Note that a hierarchical scheme enforces 
also surface continuity. 

Note that when stereo is used for 2-D segmentation 
purposes, only the disparity map is needed. One can then 
avoid using the calibration process described previously, and 
use a result of projective geometry (see, Q. T. Luong, 
"Fundamental Matrix and autocalibration in Computer 
Vision", Ph.D. Thesis, University of Paris Sud, Orsay, 
France, December 1992) showing that rectification can be 
achieved if the Fundamental Matrix is available. The fun- 
damental matrix can be used in rum to rectify the images, so 
that matching can be carried out as described previously. 
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. .To refine the 3-D position estimates,*a<subpixel correction^ 
of the integer disparity map is computed which results in a 
subpixel disparity map. The subpixel disparity can be 
obtained either 

using a second order interpolation of the correlation 

scores around the detected maximum, 
using a more general approach as described in F. 
Devernay, "Computing Differential Properties of 
{3-D} Shapes from Stereoscopic Images without 
{3-D} Models", INRIA, RR-2304, Sophia Antipolis, 
1994 (which takes into account the distortion between 
left and right correlation windows, induced by the 
perspective projection, assuming that a planar patch of 
surface is imaged). 
The first approach is the fastest while the second approach 
gives more reliable estimations of the subpixel disparity. To 
achieve fast subpixel estimation, while preserving the accu- 
racy of the estimation, we proceed as follows. Let \ L and Ij, 
arc the left and the right rectified images. Let e be the 
unknown subpixel correction, and A(u, v) be the transfor- 
mation that maps the correlation window from the left to the 
right image (for a planar target it is an affine mapping that 
preserves image rows). For corresponding pixels in the left 
and right images, 



I^K-d+c, v I >aI t (A(v L , vj) 



(10) 



where the coefficient a takes into account possible differ- 
ences in camera gains. A first order linear approximation of 

30 the above formula respect to V and 'A' gives a linear system 
where each coefficient is estimated over the corresponding 
left and right correlation windows. A least-squares solution 
of this linear system provides the subpixel correction. 
Note that in the case a continuous surface is to be 

35 recovered (as for a face in frontal pose), an interpolation 
scheme can be used on the filtered disparity map. Such a 
scheme can be derived from the following considerations. 
As we suppose the underlying surface to be continuous, the 
interpolated and smoothed disparity map d has to verify the 

40 following equation: 



mmtUJKd'-d^X^Vd^Jdu dv} 



(13) 



where X is a smoothing parameter and the integration is 
taken over the image (for pixel coordinates u and v). An 
45 iterative algorithm is straightforwardly obtained using Euler 
equations, and using an approximation of the Laplacian 
operator V. 

From the disparity map, and the camera calibration the 
spatial position of the 3D points are computed based on 

so triangulation (see Dhond et. aL, supra). The result of the 
reconstruction (from a single stereo pair of images) is a list 
of spatial points. 

In the case several images are used (polynocular stereo) a 
verification step may be used (see, R. Sara, "Reconstruction 

55 of 3-D Geometry and Topology from Polynocular Stereo", 
http://cmp.felk.cvut.cz/-sara). During this procedure, the set 
of reconstructed points, from all stereo pairs, is re-projected 
back to disparity space of all camera pairs and verified if the 
projected points match their predicted position in the other 

60 image of each of the pairs. It appears that the verification 
eliminates outliers (especially the artifacts of matching near 
occlusions) very effectively. 

FIG. 26 shows a typical result of applying a stereo 
algorithm to a stereo pair of images obtained projecting 

65 textured light. The top row of FIG. 26 shows the left right 
and a color image taken in a short time interval insuring that 
the subject did not move. The bottom row shows two views 
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.of the reconstructed face model obtainedrapplying stereo to within the bound portion of the image such that the detected,! 

the textured images, and texture mapped with the color object has a predetermined size and location within the 

image. Note that interpolation and filtering has been applied bound portion. 

to the disparity map, so that the reconstruction over the face 3. A process for recognizing objects as defined in claim 1, 

is smooth and continuous. Note also that the results is 5 further comprising a step for suppressing a background 

displayed as the raw set of points obtained from the stereo; portion, of the bound portion of the image frame, that is not 

these points can be meshed together to obtain a continuous representing the object, prior to identifying the object, 

surface for instance using the algorithm positions can be 4. A process for recognizing objects as defined in claim 3, 

compared with the jets extracted from stored gallery images. wherein the suppressed background portion is gradually 

Either complete graphs are compared, as it is the case for 1Q suppressed near edges of the object in the bound portion of 

face recognition applications, or just partial graphs or even the image frame. 

individual nodes are. g. a process for recognizing objects as defined in claim 1, 

Before the jets are extracted for the actual comparison, a wherein the object is a head of a person exhibiting a facial 

number of image normalizations are applied. One such region. 

normalization is called background suppression. The influ- 6. A process for recognizing objects as defined in claim 1, 

ence of the background on probe images needs to be wherein the bunch graph is based on a three^iinensional 

suppressed because different backgrounds between probe representation of the object. 

and gallery images lower similarities and frequently leads to 7. a process for recognizing objects as defined in claim 1, 

misclassifications. Therefore we take nodes and edges sur- wherein the wavelet transformation is performed using 

rounding the face as face boundaries. Background pixels get phase calculations that are performed using a hardware 

smoothly toned down when deviating from the face. Each 20 adapted phase representation. 

pixel value outside of the head is modified as follows: g a process for recognizing objects as defined in claims 

p^-p^Uc^i-X) (12) 1, wherein the locating step is performed using a coarse- to- 

where fioc a PP roach - 

25 9. A process for recognizing objects as defined in claim 1 , 

wherein the bunch graph is based on predetermined poses. 

A = cxp|- — j 10. A process for recognizing objects as defined in claim 

1, wherein in the identifying step uses a three-dimensional 

representation of the object. 

and c is a constant background gray value that represents the 30 11. A process for recognizing objects as defined in claim 

Euclidean distance of the pixel position from the closest 1, wherein the bound portion covers less than ten percent of 

edge of the graph. 6^ is a constant tone down value. Of the image frame. 

course, other functional dependencies between pixel value 12. A process for recognizing objects as defined in claim 
and distance from the graph boundaries are possible. 1, wherein the step for detecting the object includes detect- 
As shown in FIG. 28, the automatic background suppres- 35 ing a color associated with the object, 
sion drags the gray value smoothly to the constant when 13. A process for recognizing objects in a sequence of 
deviating from the closest edge. This method still leaves a image frames, comprising: 

background region surrounding the face visible, but it avoids detecting an object in the image frames and bounding a 

strong disturbing edges in the image, which would occur if portion of each image frame associated with the object; 

this region was simply filled up with a constant gray value. ^ transforming the bound portion of each image frame using 

While the foregoing has been with reference to specific a wavelet transformation to generate a transformed 

embodiments of the invention, it will be appreciated by image; 

those skilled in the art that these are illustrations only and locating,' on the transformed images, nodes associated 

that changes in these embodiments can be made without ^th distinguishing features of the object defined by 

departing from the principles of the invention, the scope of 45 wavelet jets of a bunch graph generated from a plurality 

which is defined by the appended claims. 0 f representative object images; 

What is claimed is: identifying the object based on a similarity between 

1. A process for recogmzmg objects in an image frame, wavdel jets associatcd ^ ^ object image m a 
comprising steps for: of object and wavelet j ets at me nodes 

detectmg an object in the image frame and bounding a 50 on me transformed images. 

portion of the image frame associated with the object 14. A process for recognizing objects as defined in claim 

resulting in a bound portion of the image frame that is i3 ( wherein the step of detecting an object further comprises 

associated with the object and an unbound portion of tracking the object between image frames based on a tra- 

the image frame that is not associated with the object; jectory associated with the object 

transforming only the bound portion and not the unbound 55 15. A process for recognizing objects as defined in claim 

portion of the image frame using a wavelet transfer- 13, further comprising a preselecting process that chooses a 

mation to generate a transformed image; m0 st suitable view of an object out of sequence of views that 

locating, on the transformed image, nodes associated with belong to a particular trajectory. 

distinguishing features of the object defined by wavelet 16. A process for recognizing objects as defined in claim 

jets of a bunch graph generated from a plurality of 60 13, wherein the step of locating the nodes includes tracking 

representative object images; the nodes between image frames. 

identifying the object based on a similarity between 17. A process for recognizing objects as defined in claim 

wavelet jets associated with an object image in a 16, further comprising reinitializing a tracked node if the 

gallery of object images and wavelet jets at the nodes node's position deviates beyond a predetermined position 

on the transformed image. 65 constraint between image frames. 

2. A process for recognizing objects as defined in claim 1, 18. A process for recognizing objects as defined in claim 
further comprising sizing and centering the detected object 17, wherein the predetermined position constraint is based 
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.on a geometrical. position .constraint associated .with. relative ... • 
positions between the node locations. 

19. A process for recognizing objects as defined in claim 
13, wherein the image frames are stereo images and the step 
of detecting includes generating a disparity histogram and a 5 
silhouette image to detect the object. 

20. A process for recognizing objects as defined in claim 
19, wherein the disparity histogram and silhouette image 
generate convex regions which are associated with head 
movement and which are detected by a convex detector. 10 

21. A process for recognizing objects as defined in claim 
13, wherein the wavelet transformations are performed 
using phase calculations that are performed using a hard- 
ware adapted phase representation. 

22. A process for recognizing objects as defined in claim 15 
13, wherein the bunch graph is based on a three-dimensional 
representation of the object. 

23. A process for recognizing objects as defined in claim 
13, wherein the locating step is performed using a coarse- 
to-fine approach. 20 

24. A process for recognizing objects as defined in claim 
13, wherein the bunch graph is based on predetermined 
poses. 

25. A process for recognizing objects as defined in claim 
13, wherein the bound portion covers less than ten percent 25 
of the image frame. 

26. A process for recognizing objects as defined in claim 
13, wherein the step for detecting the object includes detect- 
ing a color associated with the object. 

27. A process for recognizing objects as defined in claim 30 
13, wherein the step for detecting the object includes detect- 
ing movement associated with the object in the sequence of 
image frames. 

28. Apparatus for recognizing objects in an image, com- 
prising: 35 

means for detecting an object in the image frame and 
bounding a portion of the image frame associated with 
the object resulting in a bound portion of the image 
frame that is associated with the object and an unbound 



32. - Apparatus for recognizing objects as defined, in claim i 
28, further comprising means for suppressing a background 
portion, of the bound portion of the image frame, that is not 
representing the object, prior to identifying the object. 

33. Apparatus for recognizing objects as defined in claim 
32, wherein the suppressed background portion is gradually 
suppressed near edges of the object in the bound portion of 
the image frame. 

34. Apparatus for recognizing objects in a sequence of 
image frames, comprising: 

means for detecting an object in the image frames and 
bounding a portion of each image frame associated 
with the object resulting in a bound portion of each, 
image frame that is associated with the object and an 
unbound portion of each image frame that is not 
associated with the object; 

means for transforming only the bound portion and not 
the unbound portion of each image frame using a 
wavelet transformation to generate a transformed 
image; 

means for locating, on the transformed images, nodes 
associated with distinguishing features of the object 
defined by wavelet jets of a bunch graph generated 
from a plurality of representative object images; 

means for identifying the object based on a similarity 
between wavelet jets associated with an object image in 
a gallery of object images and wavelet jets at the nodes 
on the transformed images. 

35. Apparatus for recognizing objects as defined in claim 
34, wherein the bound portion covers less than ten percent 
of each image frame. 

36. Apparatus for recognizing objects as defined in claim 
34, wherein the means for detecting the object includes a 
neural network. 

37. Apparatus for recognizing objects as defined in claim 
34, wherein the means for detecting the object includes 



portion of the image frame that is not associated with «° me f° s for ****** a color . associated with the object 



the object; 

means for transforming only the bound portion and not 
the unbound portion of the image frame using a wavelet 
transformation to generate a transformed image; 

means for locating, on the transformed image, nodes 
associated with distinguishing features of the object 
defined by wavelet jets of a bunch graph generated 
from a plurality of representative object images; 



38. Apparatus for recognizing objects as denned in claim 
34, wherein the means for detecting an object further com- 
prises means for tracking the object between image frames 
based on a trajectory associated with the object. 

39. Apparatus for recognizing objects as defined in claim 
34, wherein the means for locating the nodes includes means 
for tracking the nodes between image frames. 

40. Apparatus for recognizing objects as defined in claim 



. , , . , . . . 39, further comprising means for reinitializing a tracked 

means for identifying the object based on a similarity cn , . , r. j • . u j j . a 

, , i.-f » j .... • node if the node s position deviates beyond a predetermined 



between wavelet jets associated with an object image in 
a gallery of object images and wavelet jets at the nodes 
on the transformed image. 

29. Apparatus for recognizing objects as defined in claim 
28, wherein the bound portion covers less than ten percent 55 
of the image frame. 

30. Apparatus for recognizing objects as defined in claim 
28, wherein the means for detecting the object includes a 
neural network. 

31. Apparatus for recognizing objects as defined in claim go u0ns 
28, wherein the means for detecting the object includes 

means for detecting a color associated with the object. 



position constraint between image frames. 

41. Apparatus for recognizing objects as defined in claim 
40, wherein the predetermined position constraint is based 
on a geometrical position constraint associated with relative 
positions between the node locations. 

42. Apparatus for recognizing objects as defined in claim 
34, wherein the means for transforming uses a hardware 
adapted phase representation for performing phase calcula- 
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